Introduction

I used the words “blue butterfly” to search for photos on pixabay.com, because I like the look of blue butterflies.

Screenshot of the first few rows of royalty-free photos
Screenshot of the first few rows of royalty-free photos

Before I started exploring the photo data, I noticed that most of the photos showed the butterfly having landed on some plant; probably because it’s difficult to take a picture of a moving butterfly. When I clicked some photos, they tended to have around 100-300 likes, apart from a popular photo with over 1000 likes. I also noticed that some common tags included butterfly, nature, insect, and blue. The “butterfly” and “blue” tags were likely how those images appeared in the search results, but “nature” and “insect” were more interesting to see.

Here are the URLs of my selected photos:

photo_data %>%
  select(pageURL) %>%
  knitr::kable()
pageURL
https://pixabay.com/photos/butterfly-insect-animal-142506/
https://pixabay.com/photos/butterfly-insect-nature-common-blue-95364/
https://pixabay.com/photos/butterfly-flower-pollinate-1611794/
https://pixabay.com/photos/butterfly-blue-wings-flight-insect-2837589/
https://pixabay.com/photos/blue-morpho-butterfly-flower-784979/
https://pixabay.com/photos/butterfly-insect-animal-blue-morpho-165104/
https://pixabay.com/photos/butterfly-insect-nature-common-blue-95358/
https://pixabay.com/photos/mazarine-blue-butterfly-flower-6405362/
https://pixabay.com/photos/butterfly-common-blue-insects-plant-7366626/
https://pixabay.com/photos/butterfly-insect-nature-common-blue-95363/
https://pixabay.com/photos/mazarine-blue-butterfly-butterfly-6400060/
https://pixabay.com/photos/butterfly-common-blue-insect-7347546/
https://pixabay.com/photos/common-blue-butterfly-butterfly-5163066/
https://pixabay.com/photos/butterfly-blue-butterfly-7947062/
https://pixabay.com/photos/common-blue-butterfly-insect-leaves-2733287/
https://pixabay.com/photos/mazarine-blue-butterfly-flower-7753988/
https://pixabay.com/photos/pollination-butterfly-bud-5541489/
https://pixabay.com/photos/butterfly-flowers-pollinate-6624801/
https://pixabay.com/photos/butterfly-insect-winged-insect-6807529/
https://pixabay.com/photos/hauhechel-blue-butterfly-flower-6126462/
https://pixabay.com/photos/butterfly-insect-flowers-7984538/
https://pixabay.com/photos/common-blue-polyommatus-icarus-5398789/
https://pixabay.com/photos/butterfly-butterflies-nature-insect-4882217/
https://pixabay.com/photos/silver-studded-blue-goldenrod-7489247/
https://pixabay.com/photos/butterfly-insect-butterflies-nature-5331647/
https://pixabay.com/photos/common-blue-butterfly-butterfly-571751/
https://pixabay.com/photos/bluebell-flowers-butterfly-orchid-1314391/
https://pixabay.com/photos/butterfly-common-blue-insect-grass-7374393/
https://pixabay.com/photos/butterfly-flowers-to-blue-colorful-4745902/
https://pixabay.com/photos/tropical-fish-tank-ocean-4395272/
https://pixabay.com/photos/butterflies-blue-insect-isolated-2649292/
https://pixabay.com/photos/cat-butterfly-kitten-red-mackerel-4277400/
https://pixabay.com/photos/picture-painting-to-paint-painted-1635747/
https://pixabay.com/photos/butterfly-blue-nature-blue-wing-2436395/
https://pixabay.com/photos/butterfly-exotic-south-america-176133/
https://pixabay.com/photos/blue-morpho-butterfly-south-america-4435579/
https://pixabay.com/photos/woman-female-beauty-young-girl-2814821/
https://pixabay.com/photos/fantasy-book-cover-2899517/
https://pixabay.com/photos/blue-morpho-butterfly-wildlife-1674100/
https://pixabay.com/photos/meadow-green-meadow-flower-meadow-2401911/
https://pixabay.com/photos/butterfly-flowers-nature-forest-7498779/
https://pixabay.com/photos/blue-morpho-butterfly-butterfly-7007884/
https://pixabay.com/photos/flower-butterfly-fantasy-spring-7022112/
https://pixabay.com/photos/flower-butterfly-spring-fantasy-7111468/
https://pixabay.com/photos/smoke-background-abstract-whirl-2467437/
https://pixabay.com/photos/animal-beautiful-blue-butterfly-20439/
https://pixabay.com/photos/phalaenopsis-orchid-colored-blue-1858681/
https://pixabay.com/photos/phalaenopsis-orchid-colored-blue-1858684/
https://pixabay.com/photos/butterfly-exotic-south-america-200280/
https://pixabay.com/photos/meadow-green-meadow-flower-meadow-2401931/
https://pixabay.com/photos/bluebell-flower-blossom-bloom-blue-115283/
https://pixabay.com/photos/insect-blue-flower-good-park-1154749/

And here is a GIF of my selected photos:

GIF of the preview thumbnails
GIF of the preview thumbnails

Key features of selected photos

# Summary values
median_collections <- median(photo_data$collections, na.rm = TRUE)
mean_collections <- mean(photo_data$collections, na.rm = TRUE)

median_comments <- median(photo_data$comments, na.rm = TRUE)
mean_comments <- mean(photo_data$comments, na.rm = TRUE)

tagged_flower_count <- sum(photo_data$tagged_flower == "Yes", na.rm = TRUE)

# Grouped data and summary values
selected_photos_grouped <- photo_data %>%
  select(tagged_flower,
         likes_per_thousand_views,
         downloads_per_thousand_views) %>%
  group_by(tagged_flower) %>%
  summarise(median_likes = median(likes_per_thousand_views),
            median_downloads = median(downloads_per_thousand_views))

I selected photos based on having more than 15000 views, so that we could look at the features of some more popular photos.

The median number of collections which my selected photos were in was 148.5, while the mean number of collections was 229.3653846. The median number of comments which my selected photos had was 36, while the mean number of comments was 48.5.

There were tagged_flower_count photos in my selection of photos with the tag “flower” or “flowers”.

Creativity

# Yay, plots

## Likes plots
likes_by_tag_comments <- photo_data %>%
  ggplot(aes(x = comments,
             y = likes_per_thousand_views,
             color = tagged_flower)) +
  geom_point() +
  theme_solarized_2() +
  labs(title = "Likes per thousand views by comments and collections",
       x = "Number of comments",
       y = "Likes/1000 views",
       color = "'Flower'/'flowers' tag")

likes_by_tag_collections <- photo_data %>%
  ggplot(aes(x = comments,
             y = likes_per_thousand_views,
             color = tagged_flower)) +
  geom_point() +
  theme_solarized_2() +
  labs(x = "Number of collections",
       y = "Likes/1000 views",
       color = "'Flower'/'flowers' tag",
       caption = "Source: Pixabay")

likes_plot <- likes_by_tag_comments / likes_by_tag_collections
likes_plot

## Downloads plots
downloads_by_tag_comments <- photo_data %>%
  ggplot(aes(x = comments,
             y = downloads_per_thousand_views,
             color = tagged_flower)) +
  geom_point() +
  theme_solarized_2() +
  labs(title = "Downloads per thousand views by comments and collections",
       x = "Number of comments",
       y = "Downloads/1000 views",
       color = "'Flower'/'flowers' tag")

downloads_by_tag_collections <- photo_data %>%
  ggplot(aes(x = collections,
             y = downloads_per_thousand_views,
             color = tagged_flower)) +
  geom_point() +
  theme_solarized_2() +
  labs(x = "Number of collections",
       y = "Downloads/1000 views",
       color = "'Flower'/'flowers' tag",
       caption = "Source: Pixabay")

downloads_plot <- downloads_by_tag_comments / downloads_by_tag_collections
downloads_plot

To demonstrate creativity, I’ve tried to create some informative plots about how a few selected factors are related to a photo’s quality metrics. I’ve also tried to keep in mind the context of a statistical investigation.

I decided that “quality metrics” here are the likes and downloads per 1000 views. The likes means that the photos succeeded in being admirable art to people, and the downloads means that the photos succeeded in being useful to people. I used rates instead of the raw likes and downloads to account for how photos which have more views most likely have more likes and downloads (though I could have quickly plotted that out to find out earlier as well).

The factors I chose which I thought might affect the metrics are comments, collections, and whether or not the photo has the word “flower” (or “flowers” in some cases) in its tags. Comments and collections are obvious signs of user engagement. Meanwhile, I wondered if photos tagged with another popular-seeming tag which people might search up by itself might have a difference in popularity from photos without that tag.

The plots are made to answer a question something like “how do we get a higher ratio of likes and downloads on our Pixabay photos?” (with methods such as getting people to comment, getting people to save to collections, and tagging the photo with another popular tag in consideration). What I mostly notice is that there doesn’t seem to be much difference between photos with “flower” in their tags and photos without, and that scatter increases greatly as the number of comments or collections increases. There’s plenty of room for further investigation, though.

Learning reflection

An important idea I’ve learned from this module is what JSON is and how it’s used. Most of the statistics I’ve done so far has involved data given in CSV format, but I’ve learned that JSON is a more commonly used way to store data on the web (I assume because Javascript is so popular for building websites.) It was interesting to see how the format and syntax differed from the literally rectangular data I’m used to seeing, and how the data could be requested from APIs. It’s stuff I’ve vaguely heard about but never properly learned before.

I would be curious to further explore how to manipulate data, like we’ve done with filter() and group_by() in this module. I’ve heard that cleaning and preparing data is a big part of statistics, while most data I’ve worked with for school is already prepared for me. I’ve experienced a bit of data manipulation already in this course especially in this project and the last, but I still want to learn more skills.

Appendix

library(tidyverse)
library(jsonlite)
library(magick)
library(patchwork)
library(ggthemes)
json_data <- fromJSON(readLines("pixabay_data.json"))
pixabay_photo_data <- json_data$hits

# Have a look at what's in the dataframe
View(pixabay_photo_data)
summary(pixabay_photo_data$views) # To figure out the upper quartile of views

# The required data manipulation
selected_photos <- pixabay_photo_data %>%
  filter(views > 15000) %>% # Upper quartile - reduces observations to about 50
  mutate(
    tagged_flower = ifelse(str_detect(tags, "flower"), "Yes", "No"),
    likes_per_thousand_views = likes / views * 1000,
    downloads_per_thousand_views = downloads / views * 1000
  ) %>%
  select(
    # Seems interesting to see what user-related metrics correlate with likes/downloads
    pageURL, previewURL, tagged_flower, collections, comments,
    downloads_per_thousand_views, likes_per_thousand_views
  )

View(selected_photos)
summary(selected_photos)
write_csv(selected_photos, "selected_photos.csv")

# Looking for interesting summary values
summary(selected_photos)

# Summary values
median_collections <- median(selected_photos$collections, na.rm = TRUE)
mean_collections <- mean(selected_photos$collections, na.rm = TRUE)

median_comments <- median(selected_photos$comments, na.rm = TRUE)
mean_comments <- mean(selected_photos$comments, na.rm = TRUE)

tagged_flower_count <- sum(selected_photos$tagged_flower == "Yes", na.rm = TRUE)

# Grouped data and summary values
selected_photos_grouped <- selected_photos %>%
  select(tagged_flower,
         likes_per_thousand_views,
         downloads_per_thousand_views) %>%
  group_by(tagged_flower) %>%
  summarise(median_likes = median(likes_per_thousand_views),
            median_downloads = median(downloads_per_thousand_views))

View(selected_photos_grouped)

# Creating and writing a GIF of the images
my_photos_gif <- image_read(selected_photos$previewURL) %>%
  image_scale(geometry_area(width = 150)) %>%
  image_animate(fps = 2)

my_photos_gif

image_write(my_photos_gif, "my_photos.gif")

# Yay, plots

## Likes plots
likes_by_tag_comments <- selected_photos %>%
  ggplot(aes(x = comments,
             y = likes_per_thousand_views,
             color = tagged_flower)) +
  geom_point() +
  theme_solarized_2(light = FALSE) +
  labs(title = "Likes per thousand views by comments and collections",
       x = "Number of comments",
       y = "Likes/1000 views",
       color = "'Flower'/'flowers' tag")

likes_by_tag_collections <- selected_photos %>%
  ggplot(aes(x = comments,
             y = likes_per_thousand_views,
             color = tagged_flower)) +
  geom_point() +
  theme_solarized_2(light = FALSE) +
  labs(x = "Number of collections",
       y = "Likes/1000 views",
       color = "'Flower'/'flowers' tag",
       caption = "Source: Pixabay")

likes_plot <- likes_by_tag_comments / likes_by_tag_collections
likes_plot

## Downloads plots
downloads_by_tag_comments <- selected_photos %>%
  ggplot(aes(x = comments,
             y = downloads_per_thousand_views,
             color = tagged_flower)) +
  geom_point() +
  theme_solarized_2(light = FALSE) +
  labs(title = "Downloads per thousand views by comments and collections",
       x = "Number of comments",
       y = "Downloads/1000 views",
       color = "'Flower'/'flowers' tag")

downloads_by_tag_collections <- selected_photos %>%
  ggplot(aes(x = collections,
             y = downloads_per_thousand_views,
             color = tagged_flower)) +
  geom_point() +
  theme_solarized_2(light = FALSE) +
  labs(x = "Number of collections",
       y = "Downloads/1000 views",
       color = "'Flower'/'flowers' tag",
       caption = "Source: Pixabay")

downloads_plot <- downloads_by_tag_comments / downloads_by_tag_collections
downloads_plot